The topic of this capstone project is about Michelin Restaurants ratings. Today, there are many different restaurant, so it is more and more difficult to stand out from the crowd. This project is intended for every restaurant owner who want to earn their first, second or third star (three groups). To help them to reach their goal, we will first study in globally each group to understands the common characteristics within each group or the influence of the location with the foursquare API.
During this project, we will use the Michelin Restaurants dataset from Kaggle and the foursquare API. There is three files, one for each group (one star, two stars and three stars), they all have the same columns names. Here is a little description about theres variables:
We will see later that to use correctly the foursquare API, we need neighborhoods names, but this informations is not available in the kaggle dataset, we had to web-scrap some pages to have this information.
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import Michelin_aux as al
import Michelin_cleaning_lib as nl
import Michelin_scraping_lib as sl
full_df = nl.full_data()
full_df.head()
| name | year | latitude | longitude | city | region | zipCode | cuisine | price | url | Star | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Kilian Stuba | 2019 | 47.348580 | 10.17114 | Kleinwalsertal | Austria | 87568 | Creative | $$$$$ | https://guide.michelin.com/at/en/vorarlberg/kl... | 1 |
| 1 | Pfefferschiff | 2019 | 47.837870 | 13.07917 | Hallwang | Austria | 5300 | Classic cuisine | $$$$$ | https://guide.michelin.com/at/en/salzburg-regi... | 1 |
| 2 | Esszimmer | 2019 | 47.806850 | 13.03409 | Salzburg | Austria | 5020 | Creative | $$$$$ | https://guide.michelin.com/at/en/salzburg-regi... | 1 |
| 3 | Carpe Diem | 2019 | 47.800010 | 13.04006 | Salzburg | Austria | 5020 | Market cuisine | $$$$$ | https://guide.michelin.com/at/en/salzburg-regi... | 1 |
| 4 | Edvard | 2019 | 48.216503 | 16.36852 | Wien | Austria | 1010 | Modern cuisine | $$$$ | https://guide.michelin.com/at/en/vienna/wien/r... | 1 |
First, we have to clean our datasets. To do this, we have to look first on the price variable. As we can see, we must convert this variable, and transform theres \$ as categorical values, like "cheap", "mid-cheap", "medium", "mid-expensive" and "expensive".Then, we have to one-hot-encode the variables cuisine, price and region. Last, we have to drop all variables which are not useful in a machine learning model. That is the purpose of the pipeline below:
full_df_ML = (
full_df
.pipe(nl.replace_dollar)
.pipe(nl.one_hot_data_pipe)
.pipe(nl.multiple_del_pipe)
)
Our data has been cleaned, now we can start to search about the cities we want to study. To do this, we will select cities which have the most stared restaurants.
best_city_one, best_city_two, best_city_three = al.best_cities(full_df)
import plotly.offline as pyo
pyo.init_notebook_mode()
al.fig_best_cities(best_city_one, best_city_two, best_city_three)
This graph show us that :
full_df_NY = full_df[full_df["city"]=="New York"].reset_index(drop=True)
al.create_map(full_df_NY, "New York, NY")
full_df_HK = full_df[full_df["city"]=="Hong Kong"].reset_index(drop=True)
al.create_map(full_df_HK, "Hong Kong, HK")
full_df_SF_map = full_df[full_df["city"]=="San Francisco"].reset_index(drop=True)
al.create_map(full_df_SF_map, "San Francisco, SF")
Unfortulately, we will not be able to study Hong Kong, because, there is no zip code and so we cannot find the neighborhood name.
We now are able to start our machine learning part, I tried to use Kmeans clustering but there was not enough row to fit it, so I used a hierarchical clustering to handle this issue.
full_df_NY = full_df_ML[full_df_ML["city"]=="New York"].reset_index(drop=True)
full_df_NY = sl.scraping_NY(full_df_NY)
full_df_NY = full_df_NY.dropna()
NY_venues = sl.getNearbyVenues(names=full_df_NY['Neighborhood'],
latitudes=full_df_NY['latitude'],
longitudes=full_df_NY['longitude'])
NY_grouped = nl.clean_group_venues(NY_venues)
neighborhoods_venues_sorted = al.df_top_venues(NY_grouped, 10)
NY_grouped_clustering = NY_grouped.drop('Neighborhood', 1)
NY_grouped_clustering
al.dendo(NY_grouped_clustering)
silhouette_NY = al.silhouette(NY_grouped_clustering)
kvals = np.arange(2, 11, 1)
al.silhouette_graph(NY_grouped_clustering, kvals)
According to dendogram and silhouette metric, the optimal number of clusters is two, so we will build the model with k = 3
NY_merged = al.Add_predict_vector(full_df_NY, neighborhoods_venues_sorted, al.HC_predict(NY_grouped_clustering, 3))
al.create_map_clusters(NY_merged, "New York, NY", 3, zoom=12)
full_df_SF = full_df[full_df["city"]=="San Francisco"].reset_index(drop=True)
full_df_SF["Neighborhood"] = ""
full_df_SF = sl.scraping_SF(full_df_SF)
full_df_SF = full_df_SF.dropna()
sf_venues = sl.getNearbyVenues(names=full_df_SF['Neighborhood'],
latitudes=full_df_SF['latitude'],
longitudes=full_df_SF['longitude'])
sf_grouped = nl.clean_group_venues(sf_venues)
neighborhoods_venues_sorted_SF = al.df_top_venues(sf_grouped, 10)
sf_grouped_clustering = sf_grouped.drop('Neighborhood', 1)
al.dendo(sf_grouped_clustering)
kvals = np.arange(2, 11, 1)
al.silhouette_graph(sf_grouped_clustering, kvals)
According to dendogram and silhouette metric, the optimal number of clusters is two, so we will build the model with k = 2
silhouette_SF = al.silhouette(sf_grouped_clustering)
sf_merged = al.Add_predict_vector(full_df_SF, neighborhoods_venues_sorted_SF, al.HC_predict(sf_grouped_clustering, 2))
al.create_map_clusters(sf_merged, "San Francisco, SF", 2, zoom=12)
The results of hierarchical clustering show that we can cluster the neighbors into 3 clusters for New-York and 2 for San Francisco:
About New-York, we learned that there is an high concentration of stared restaurant inside, and an high concurrence. That is the same problem for the North of San Francisco. Rent prices are higher and it is more difficult to enter the market. Outside manhattan and the South of San Francisco could be a better option, as a stared restaurant, you want the best for customers! Prices are lower, as the concurrence, so it will be easier to enter the market. Your choice depend of your first investment.
For future research, we could think about study more cities, it could be great to know more about different cities. We could deepen the study of each city, price per square foot of each neighborhood, maybe build a dashboard.
In this project, we have gone though the process of identifying the business problem, specifying the data required, extracting and preparing the data, performing machine learning by clustering the data into 3 cluster for New York and 2 for San Francisco, that helped us to better understand the markets.